Add some missing references #117

ycmin95 · 2024-10-13T14:24:12Z

Added missing references: cheng2020fully, min2021visual, jiao2023cosign, zhou2023gloss, jiao2024visual.
Plan to include more references from VIPL-SLP/awesome-sign-language-processing in future updates.

… zhou2023gloss, jiao2024visual

AmitMY

Thank you very much for your contribution, happy to see you join the effort :)

I left a few comments for minor changes.

For future PRs, I find it easier to review PRs that add a single paragraph / refer to a single topic (with a few citations). It makes it so I can review even if I have less time, and merge some changes while waiting for changes for others (just a preference).

AmitMY · 2024-10-15T13:05:24Z

src/references.bib

@@ -1546,6 +1546,14 @@ @article{jiang2021sign
 year = {2021}
 }

+@inproceedings{jiao2023cosign,
+  title={CoSign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition},


need to add {} to terms such as {CoSign}

Thanks for the suggestion, I have revised this reference item.

AmitMY · 2024-10-15T13:08:32Z

src/index.md

@@ -538,10 +538,9 @@ Though some previous works have referred to this as "sign language translation,"
 without handling the syntax and morphology of the signed language [@padden1988interaction] to create a spoken language output.
 Instead, SLR has often been used as an intermediate step during translation to produce glosses from signed language videos.

-@jiang2021sign proposed a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multimodal feature representations. Specifically, they use a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The proposed late-fusion GEM fuses the skeleton-based predictions with other RGB and depth-based modalities to provide global information and make an accurate SLR prediction.
+@jiang2021sign propose a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multimodal feature representations. Specifically, they use a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The proposed late-fusion GEM fuses the skeleton-based predictions with other RGB and depth-based modalities to provide global information and make an accurate SLR prediction. @jiao2023cosign explore co-occurence signals in skeleton data to better exploit the knowledge of each signal for continuous SLR.  Specifically, they use Group-specific GCN to abstract skeleton features from co-occurence signals (Body, Hand, Mouth and Hand) and introduce complementary regularization to ensure consistency between predictions based on two complementary subsets of signals. Additionally, they propose a two-stream framework to fuse static and dynamic information. The model demonstrates competitive performance cpmpared to video-to-gloss methods on the RWTH-PHOENIX-Weather-2014 [@koller2015ContinuousSLR], RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets.


to minimize the diff, and for organization in more than one line, please add a new line before @jiao2023cosign (it will still show in one paragraph). I'd even propose to add a new line after every end of sentence, to make it easier to give comments

(but this paragraph looks good to me otherwise!)

I have divided this paragraph into individual sentences to more clearly highlight the distinctions. Besides, in the previous version, I changed the tense of the previous sentence from past tense to present tense (@jiang2021sign proposed -> @jiang2021sign propose), and I recover the original version in the updated version.

Currently, it appears that the tenses in this project are not consistent and may require an overall review and correction.

src/index.md

AmitMY · 2024-10-15T13:11:28Z

src/index.md

@@ -587,7 +586,11 @@ For this recognition, @cui2017recurrent constructs a three-step optimization mod
 First, they train a video-to-gloss end-to-end model, where they encode the video using a spatio-temporal CNN encoder
 and predict the gloss using a Connectionist Temporal Classification (CTC) [@graves2006connectionist].
 Then, from the CTC alignment and category proposal, they encode each gloss-level segment independently, trained to predict the gloss category,
-and use this gloss video segments encoding to optimize the sequence learning model.
+and use this gloss video segments encoding to optimize the sequence learning model. @cheng2020fully propose a fully convolutional networks for continuous SLR, 


(here as well, new line before sentence)

AmitMY · 2024-10-15T13:12:18Z

src/index.md

@@ -587,7 +586,11 @@ For this recognition, @cui2017recurrent constructs a three-step optimization mod
 First, they train a video-to-gloss end-to-end model, where they encode the video using a spatio-temporal CNN encoder
 and predict the gloss using a Connectionist Temporal Classification (CTC) [@graves2006connectionist].
 Then, from the CTC alignment and category proposal, they encode each gloss-level segment independently, trained to predict the gloss category,
-and use this gloss video segments encoding to optimize the sequence learning model.
+and use this gloss video segments encoding to optimize the sequence learning model. @cheng2020fully propose a fully convolutional networks for continuous SLR, 


"a fully convolutional networks" should be "fully convolutional networks" or "a fully convolutional network"

Thanks for the suggestion, I have revised this sentence.

AmitMY · 2024-10-15T13:12:52Z

src/index.md

@@ -587,7 +586,11 @@ For this recognition, @cui2017recurrent constructs a three-step optimization mod
 First, they train a video-to-gloss end-to-end model, where they encode the video using a spatio-temporal CNN encoder
 and predict the gloss using a Connectionist Temporal Classification (CTC) [@graves2006connectionist].
 Then, from the CTC alignment and category proposal, they encode each gloss-level segment independently, trained to predict the gloss category,
-and use this gloss video segments encoding to optimize the sequence learning model.
+and use this gloss video segments encoding to optimize the sequence learning model. @cheng2020fully propose a fully convolutional networks for continuous SLR, 
+moving away from LSTM-based methods to achieve end-to-end learning. They introduce a gloss feature enhancement (GFE) module to provide additional rectified supervision and 


gloss feature enhancement should be capitalized (Gloss Feature Enhancement) because an acronym is introduced

Thanks for the suggestion, I have revised this sentence.

AmitMY · 2024-10-15T13:14:17Z

src/index.md

+and use this gloss video segments encoding to optimize the sequence learning model. @cheng2020fully propose a fully convolutional networks for continuous SLR, 
+moving away from LSTM-based methods to achieve end-to-end learning. They introduce a gloss feature enhancement (GFE) module to provide additional rectified supervision and 
+accelerate the training process. @min2021visual attribute the success of iterative training to its ability to reduce overfitting. They propose visual enhancement
+constraint (VEC) and visual alignment constraint (VAC) to strengthen the visual extractor and align long- and short-term predictions, enabling LSTM-based methods to be trained in an end-to-end manner. 


"visual enhancement constraint" should be capitalized, same for "visual alignment constraint"

Thanks for the suggestion, I have capitalized them.

AmitMY · 2024-10-15T13:16:03Z

src/index.md

@@ -742,6 +745,10 @@ The model features shared representations for different modalities such as text
 on several tasks such as video-to-gloss, gloss-to-text, and video-to-text. 
 The approach allows leveraging external data such as parallel data for spoken language machine translation.

+@zhou2023gloss propose the GFSLT-VLP framework for gloss-free sign language translation, which improves SLT performance through visual-alignment pretraining. In the pretraining stage, they design a pretext task that aligns visual and textual 


better imo from

@zhou2023gloss propose the GFSLT-VLP framework for gloss-free sign language translation, which improves SLT performance through visual-alignment pretraining.

to

@zhou2023gloss propose the Gloss-Free Sign Language Translation with Visual Alignment Pretraining (GFSLT-VLP) framework, to improve SLT performance.

Thanks for the suggestion, I have revised this sentence.

AmitMY · 2024-10-15T13:17:27Z

src/index.md

@@ -792,6 +798,10 @@ and showed similar performance, with the transformer underperforming on the vali
 They experimented with various normalization schemes, mainly subtracting the mean and dividing by the standard deviation of every individual keypoint
 either concerning the entire frame or the relevant "object" (Body, Face, and Hand).

+@jiao2024visual propose a visual alignment pre-training framework for gloss-free sign language translation. Specifically, they adopt Cosign-1s [@jiao2023cosign] to obtain skeleton features from estimated pose sequences


Cosign or CoSign?

CoSign, thanks!

AmitMY · 2024-10-15T13:18:07Z

src/index.md

@@ -792,6 +798,10 @@ and showed similar performance, with the transformer underperforming on the vali
 They experimented with various normalization schemes, mainly subtracting the mean and dividing by the standard deviation of every individual keypoint
 either concerning the entire frame or the relevant "object" (Body, Face, and Hand).

+@jiao2024visual propose a visual alignment pre-training framework for gloss-free sign language translation. Specifically, they adopt Cosign-1s [@jiao2023cosign] to obtain skeleton features from estimated pose sequences
+and a pretrained text encoder to obtain corresponding textual features. During pretraining, these visual and textual features are aligned in a greedy manner. In the finetuning stage, they replace the shallow translation module 
+used in pretraining with a pretrained translation module. This skeleton-based approach achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural], CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily], OpenASL [@shi-etal-2022-open], and How2Sign[@dataset:duarte2020how2sign] datasets without relying on gloss annotations.


missing space after How2Sign

The space has been added.

ycmin95 · 2024-10-16T06:49:46Z

Thank you for your comprehensive feedback. I have revised relevant parts and restructured the sentences for clarity. Additionally, I’ve discovered that incorporating hyperlinks for each reference can greatly enhance the document’s usability.

AmitMY

Great job!

add missing references cheng2020fully, min2021visual, jiao2023cosign,…

c2e4f89

… zhou2023gloss, jiao2024visual

AmitMY requested changes Oct 15, 2024

View reviewed changes

ycmin95 added 2 commits October 16, 2024 14:32

revise the updated version

35ae069

update reference links

0597290

AmitMY approved these changes Oct 16, 2024

View reviewed changes

AmitMY merged commit f06ea3a into sign-language-processing:master Oct 16, 2024
1 of 2 checks passed

Add some missing references #117

Add some missing references #117

Uh oh!

Conversation

ycmin95 commented Oct 13, 2024

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycmin95 Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycmin95 Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycmin95 Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycmin95 commented Oct 16, 2024

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ycmin95 Oct 16, 2024 •

edited

Loading

ycmin95 Oct 16, 2024 •

edited

Loading

ycmin95 Oct 16, 2024 •

edited

Loading